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Abstract 

There are numerous examples of problems in symbolic algebra in which the required storage 
grows far beyond the limitations even of the distributed RAM of a cluster. Often this limitation 
determines how large a problem one can solve in practice. Roomy provides a minimally invasive 
system to modify the code for such a computation, in order to use the local disks of a cluster 
or a SAN as a transparent extension of RAM. 

Roomy is implemented as a C/C++ library. It provides some simple data structures (arrays, 
unordered lists, and hash tables). Some typical programming constructs that one might employ 
in Roomy are: map, reduce, duplicate elimination, chain reduction, pair reduction, and breadth- 
first search. All aspects of parallelism and remote I/O are hidden within the Roomy library. 

1 Introduction 

This paper provides a brief introduction to Roomy [1], a new programming model and open source 
library for parallel disk-based computation. The primary purpose of Roomy is to solve space limited 
problems without significantly increasing hardware costs or radically altering existing algorithms 
and data structures. 

Roomy uses disks as the main working memory of a computation, instead of RAM. These disks 
can be disks attached to a single shared- memory system, a storage area network (SAN), or the 
locally attached disks of a compute cluster. Particularly in the case of using the local disks of 
a cluster, disks are often underutilized and can provide several order of magnitude more working 
memory than RAM for essentially no extra cost. 

There are two fundamental challenges in using disk-based storage as main memory: 

Bandwidth: roughly, the bandwidth of a single disk is 50 times less than that of a single 
RAM subsystem (100 MB/s versus 5 GB/s). The solution is to use many disks in parallel, 
achieving an aggregate bandwidth comparable to RAM. 

Latency: even worse than bandwidth, the latency of disk is many orders of magnitude worse 
than RAM. The solution is to avoid latency penalties by using streaming data access, instead 
of costly random access. 

Roomy hides from the programmer both the complexity inherent in parallelism and the tech- 
niques needed to convert random access patterns into streaming access patterns. In doing so, the 
programming model presented to the user closely resembles that of traditional RAM-based serial 
computation. 
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The rest of this paper briefly describes the data structures provided by Roomy, and some 
example programming constructs that can be implemented using these data structures. Complete 
documentation, and instructions for obtaining the Roomy open source library, can be found on the 
Web at roomy.sourceforge.net. 

2 Roomy Data Structures 

Roomy data structures are transparently distributed across many disks, and the operations on 
these data structures are transparently parallelized across the many compute nodes of a cluster. 
Currently, there are three Roomy data structures: 

RoomyArray: a fixed size, indexed array of elements (elements can be as small as one bit). 
RoomyHashTable: a dynamically sized structure mapping keys to values. 
RoomyList: a dynamically sized, unordered list of elements. 

There are two types of Roomy operations: delayed and immediate. If an operation requires 
random access, it is delayed. Otherwise, it is performed immediately. To initiate the processing 
of delayed operations for a given Roomy data structure, the programmer makes an explicit call to 
synchronize that data structure. By delaying random access operations they can be collected and 
performed more efficiently in batch. 

Table 1 describes some of the basic Roomy operations. Some operations are specific to one 
type of Roomy data structure, while others apply to all three. The operations are also identified 
as either immediate (I) or delayed (D). 

For performance reasons, it is often best to use a RoomyArray or RoomyHashTable instead of 
a RoomyList, where possible. Computations using RoomyLists are often dominated by the time 
to sort the list and any delayed operations. RoomyArrays and RoomyHashTables avoid sorting by 
organizing data into buckets, based on indices or keys. 

3 Programming Constructs 

Because Roomy provides data structures and operations similar to traditional programming models, 
many common programming constructs can be implemented in Roomy without significant modi- 
fication. The one major difference is in the use of delayed random operations. To ensure efficient 
computation, it is important to maximize the number of delayed random operations issued before 
they are executed (by calling sync on the data structure). 

Below are Roomy implementations of six programming constructs: map, reduce, set operations, 
chain reduction, pair reduction, and breadth-first search. Both map and reduce are primitive 
operations in Roomy. The others are built using Roomy primitives. 
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Table 1: Some basic Roomy operations. If an operation is specific to one type of data structure, 
it is listed under RoomyArray, RoomyHashTable, or RoomyList. Otherwise, it is listed as "common 
to all". Also, the type of each operation is given as either immediate (I) or delayed (D). 



Data Structure 


Name 


Type 


Description 


RoomyArray 


access 


D 


apply a user-defined function to an element 




update 


D 


update an element using a user-defined function 


RoomyHashTable 


insert 


I 

D 


insert a given (key, value) pair in the table 




remove 


I 

D 


given a key, remove the corresponding (key, value) pair 








from the table 




access 


I \ 

D 


given a key, apply a user-denned function to the cor- 








l * 1 

responding value 




updat e 


Pi 

L) 


given a key, update a the corresponding value using a 








user-defined function 


RoomyList 


add 


D 


add a single element to the list 




remove 


D 


remove all occurrences of a single element from the list 




addAll 


I 


adds all elements from one list to another 




removeAll 


I 


removes all elements in one list from another 




removeDupes 


I 


removes duplicate elements from a list 


Common to all 


sync 


I 


process all outstanding delayed operations for the data 








structure 




size 


I 


returns the number of elements in the data structure 




map 


I 


applies a user-defined function to each element 




reduce 


I 


applies a user-defined function to each element and 








returns a value (e.g. the ten largest elements of the 








list) 




predicateCount 


I 


returns the number of elements that satisfy a given 








property (Note: this does not require a separate scan, 








the count is kept current as the data is modified) 
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First, note that the code given here uses a simplified syntax. For example, the doUpdate method 
from the chain reduction programming construct below would be implemented in Roomy as: 

void doUpdate ( uint64 locallndex , void* localVal , 
void* rcmoteVal) { 
* ( int * ) localVal = 

*( int *) local Val + * ( int * ) remoteVal ; 

} 

The simplified version given here eliminates the type casting, and appears as: 

int doUpdate ( int locallndex, int localVal, 
int remoteVal) { 
return localVal + rcmoteVal; 

} 

A future C++ version of Roomy is planned that would use templates to make the simplified 
version legal code. 

See the online Roomy documentation and API [1] for the exact syntax and function definitions. 

Map The map operator applies a user-defined function to every element of a Roomy data structure. 
As an example, the following converts a RoomyArray into a RoomyHashTable, with array indices as 
keys and the associated elements as values. 

RoomyArray ra ; // elements of type T 

RoomyHashTable rht ; // pairs of type ( int , T) 

// Function to map over RoomyArray ra . 
void makePair(int i, T element) { 

RoomyHashTable_insert ( rht , i, element); 

} 

// Perform map, then complete delayed inserts 
Roomy Array _map ( ra , makePair); 
RoomyHashTable.sync ( rht ) ; 



Reduce The reduce operator produces a result based on a combination of all elements in a data 
structure. It requires two user-defined functions. The first combines a partially computed result 
and an element of the list. The second combines two partially computed results. The order of 
reductions is not guaranteed. Hence, these functions must be associative and commutative, or else 
the result is undefined. 

As an example, the following computes the sum of squares of the elements in a RoomyList. 
RoomyList rl ; // elements of type int 

// Function to add square of an element to sum. 
int mergcElt(int sum, int element) { 
return sum + element * element ; 

} 

// Function to compute sum of two partial answers. 
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int mergeResults ( int suml , int sum2) { 
return suml + sum2 ; 

} 

int sum = 

RoomyList_reduce ( rl , mergeElt , mergeResults); 

The type of the result does not necessarily have to be the same as the type of the elements in 
the list, as it is in this case. For example, the result could be the k largest elements of the list. 

Set Operations Roomy can support certain set operations through the use of a RoomyList. 
Some of these operations (particularly intersection) are sub-optimal when built using the current 
set of primitives. Future work is planned to add a native RoomySet data structure. 
A RoomyList can be converted to a set by removing duplicates. 

RoomyList A; // can contain duplicate elements 
RoomyList.removeDupes (A) ; // now a set 

Performing set union, A = A U B, is also simple. 

RoomyList A, B; 
RoomyList.addAll (A, B); 
RoomyList.removeDupes (A) ; 

Set difference, A = A — B, is performed by using just the removeAll operation, assuming A 
and B are already sets. 

RoomyList A, B; 
RoomyList_removeAll (A, B); 

Finally, set intersection is implemented as a union, followed set differences: C = (A + B) — (A — 
B) — {B — A). Set intersection may become a Roomy primitive in the future. 

// i np ut sets 

RoomyList A, B; 

// initially empty sets 

RoomyList AandB , AminusB , BminusA , C; 

// create three temporary sets 
RoomyList.addAll (AandB , A); 
RoomyList.addAll (AandB , B); 
RoomyList.removeDupes (AandB ) ; 
RoomyList_addAll (AminusB , A); 
RoomyList_removeAll( AminusB , B) ; 
RoomyList.addAll (BminusA , B); 
RoomyList_removeAll(BminusA , A); 

// compute intersection 
RoomyList_addAll(C, AandB); 
RoomyList_removeAll(C, AminusB); 
RoomyList_removeAll(C, BminusA); 
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Chain Reduction Chain reduction combines each element in a sequence with the element after 
it. In this example, we compute the following function for an array of integers a of length N 

for i = 1 to N-l 

a [ i ] = a [ i ] + a [ i — 1] 

where all array elements on the right-hand side are accessed before updating any array elements 
on the left-hand side. 

In the following code, val_i represents a[i] and 
val_iMinusl represents a[i-l]. 

RoomyArray ra ; // array of ints , length N 
// Function to complete updates 

int dollpdate ( int i, int val_i , int vaLiMinusl ) { 
return val_i + vaLiMinusl ; 

} 

// Function to be mapped over ra , issues updates 
void callUpdate ( int iMinusl , int val_iMinusl) { 
int i = iMinusl + 1; 
if i < N 

Roomy Array_update ( 

ra , i, vaLiMinusl , dollpdate ) ; 

} 

Roomy Array _map ( ra , callUpdate); // issue updates 
RoomyArray_sync ( ra ) ; // complete updates 

The computation is deterministic. The new array values will be based only on the old array 
values because Roomy guarantees that none of the delayed update operations will be executed 
until sync is called. The code above is implemented internally through a traditional scatter-gather 
operation. 

Parallel Prefix The chain reduction programming construct can also be used as the basis for a 
parallel prefix computation. At a high level, the parallel prefix computation is defined as 

for (k = 1; k < N; k = k * 2) 
if i-k >= 

a [ i ] = a [ i ] + a [ i -k ] ; 



Pair Reduction Pair reduction applies a function to each pair of elements in a collection. For 
an array a of length N, pair reduction is defined as 

for i = to N-l 
for j = to N-l 

f(Mi], a[j]); 

The following example inserts each pair of elements from a RoomyArray into a RoomyList. The 
variable outerVal represents a[i] and the variable innerVal represents a[j] . 
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RoomyArray ra ; // array of int, length N 
RoomyList rl ; // list containing Pair (int, int) 

// Access function , adds a pair to the list 
void doAccess ( int innerlndex , int innerVal , 
int outerVal) { 
RoomyList.add ( 

rl , new Pair ( innerVal , outerVal)); 

} 

// Map function , sends access to all other elts 
void callAccess ( int outerlndex , int outerVal) { 
for innerlndex = to N— 1 
RoomyArray.access ( 

ra , innerlndex , outerVal , doAccess ) ; 

} 

Roomy Array _map ( ra , callAccess); 

Roomy Array_sync ( ra ) ; // perform delayed accesses 
RoomyList.sync ( rl ) ; // perform delayed adds 

One can think of the RoomyArray _map method as the outer loop, the callAccess method as 
the inner loop, and the doAccess method as the function being applied to each pair of elements. 

Breadth-first Search Breadth-first search enumerates all of the elements of a graph, exploring 
elements closer to the starting point first. In this case, the graph is implicit, defined by a starting 
element and a generating function that returns the neighbors of a given element. 

// Lists for all elts, current, and next level 
RoomyList* all = RoomyList_make ( " allLev " , eltSize); 
RoomyList* cur = RoomyList_make ( " levO " , eltSize); 
RoomyList* next = RoomyList_make ( " lev 1 " , eltSize); 

// Function to produce next level from current 
void genNext (T elt) { 

/* User— defined code to compute neighbors ... */ 

for nbr in neighbors 

RoomyList_add (next , nbr); 

} 

// Add start element 
RoomyList_add ( all , startElt); 
RoomyList_add ( cur , startElt); 

// Generate levels until no new states are found 
while ( RoomyList _size ( cur ) ) { 

// generate next level from current 

RoomyList_map ( cur , genNext); 

RoomyList_sync ( next ) ; 

// detect duplicates within next level 
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RoomyList.removeDupes ( next ) ; 

// detect duplicates from previous levels 
RoomyList_removeAll ( next , all); 

// record new elements 
RoomyList_addAll ( all , next); 

// rotate levels 
RoomyList_destroy(cur); 
cur = next ; 

next = RoomyList_make ( levName , eltSize); 
} 

One of the initial tests of Roomy was to use breadth-first search to solve the pancake sorting 
problem. Pancake sorting operates using a sequence of prefix reversals (reversing the order of 
the first k elements of the sequence). The sequence can be thought of as a stack of pancakes of 
varying sizes, with the prefix reversal corresponding to flipping the top k pancakes. The goal of 
the computation is to determine the number of reversals required to sort any sequence of length n. 

Using Roomy, the entire application took less than one day of programming and less than 
200 lines of code. Breadth-first search was implemented using a RoomyArray, similar to the 
RoomyList-based version presented above. 

Three different solutions to the pancake sorting problem, each using one of the three Roomy 
data structures, is available in the Roomy online documentation [1]. 
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